Cache CLI extractor paths across Actions steps#3950
Conversation
Repeated calls to `resolveLanguages()` will only pay the performance penalty of executing `codeql resolve languages` once.
By wrapping `resolveLanguages()`, which is memoized, we can avoid executing `codeql resolve extractor` several times over the course of an analysis.
henrymercer
left a comment
There was a problem hiding this comment.
Caching these invocations makes a lot of sense! I have a high level comment and a couple of lower level comments.
The main point is that now that we're caching multiple invocations, it might be a good opportunity to generalise the design. For instance, you could imagine something like:
const versionCache = createPersistedCliCache({ envVar: EnvVar.CODEQL_VERSION_INFO, validate: isVersionInfo });
const resolveLanguagesCache = createPersistedCliCache({ envVar: EnvVar.CODEQL_RESOLVE_LANGUAGES, validate: isResolveLanguagesOutput });where createPersistedCliCache handles memoising in the Action and persisting between Actions steps with an environment variable.
Some smaller things:
- Ideally the cache entry would also depend on
getExtraOptionsFromEnv(["resolve","languages"]) - We should remove the cache in
testing-utils.tslike we do for the CodeQL version cache
| // This can be a bit slow due to the JVM startup cost. Instead, get | ||
| // the extractor path from resolveLanguages(), which caches its output. | ||
| const extractors = await this.resolveLanguages(); | ||
| return extractors[language][0]; |
There was a problem hiding this comment.
Does this correctly handle language aliases? We do currently normalise languages to their original names, but it would be good to at least document this change of behaviour if not resolve aliases.
Also, we should make sure that extractors[language] is defined and if not, error nicely.
mbg
left a comment
There was a problem hiding this comment.
I agree with @henrymercer's comments regarding a more generalised design for this. I am wondering about the use of environment variables here vs using a file on disk. I don't know if you have already considered this, but we store e.g. the Action configuration on disk as a file. Perhaps that would make sense for these cached CLI results as well.
A general point: could we also make sure to add doc comments for new top-level definitions before merging?
| CODEQL_VERSION_INFO = "CODEQL_ACTION_CLI_VERSION_INFO", | ||
|
|
||
| /** | ||
| * `ResolveLanguagesOutput` for the CodeQL CLI, so later Actions steps can reuse it instead of |
There was a problem hiding this comment.
Minor: ResolveLanguagesOutput is a type, but the environment variable presumably stores some representation of a value of that type. For example:
| * `ResolveLanguagesOutput` for the CodeQL CLI, so later Actions steps can reuse it instead of | |
| * Used to store a stringified JSON representation of a `ResolveLanguagesOutput` value for the CodeQL CLI, so later Actions steps can reuse it instead of |
| function isPersistedResolveLanguagesOutput( | ||
| value: unknown, | ||
| ): value is PersistedResolveLanguagesOutput { | ||
| return ( | ||
| typeof value === "object" && | ||
| value !== null && | ||
| typeof (value as Record<string, unknown>).cmd === "string" && | ||
| typeof (value as Record<string, unknown>).output === "object" && | ||
| (value as Record<string, unknown>).output !== null | ||
| ); | ||
| } |
There was a problem hiding this comment.
I know this mirrors what was done in #3943, but it would probably make sense to use the helper functions from the ./json module.
Similar to #3943, this PR caches the output of
codeql resolve languages, which contains the paths to the various extractors so that repeated calls toresolveLanguages()are idempotent. Additionally, re-implementresolveExtractor()as a wrapper overresolveLanguages()(to re-use the cached output) rather than shell out tocodeql resolve extractor.In one experiment, I counted seven instances of shelling out to
codeql resolve extractor. When you dig into the code, you can see why:resolveExtractor()is not called often or from many places; But one caller isisTracedLanguage(), which is wrapped byisScannedLanguage(). And these functions are often used in a loop/map over all/some languages. This can explain why we see consecutive executions ofcodeql resolve extractor.Risk assessment
For internal use only. Please select the risk level of this change:
Which use cases does this change impact?
Workflow types:
dynamicworkflows (Default Setup, Code Quality, ...).Products:
analysis-kinds: code-scanning.analysis-kinds: code-quality.upload-sarifaction.Environments:
github.comand/or GitHub Enterprise Cloud with Data Residency.How did/will you validate this change?
.test.tsfiles).pr-checks).If something goes wrong after this change is released, what are the mitigation and rollback strategies?
How will you know if something goes wrong after this change is released?
Are there any special considerations for merging or releasing this change?
Merge / deployment checklist